In this notebook I'll attempt to build a quick ML model (2-3 hours) to predict whether a mobile ad will be clicked given set of params. Data comes from the kaggle competition available in the follwing link: https://www.kaggle.com/c/avazu-ctr-prediction/overview/evaluation
We do have somewhat large test and training files. Zipped size of training data is over 1GB
! ls -lh | grep .gz
It's a good point to stop and discuss our strategy on processing files (data engineering) and training our ML models. Since we have somewhat large dataset we've got following options:
pdpipe is a good choice for lightweight data pipelineing option in pandas domain. Prefect which can be used for task scheduling and management (Great data pipelines) MLflow, MLLib ..etc), howevever it involves creating Spark clusters, which might be time consuming, considering this will be a quick task (2-3 hours) I tend to skip it for the time being. Let's have a brief look at the contents of train.gz and test.gz file
! gzcat train.gz | head | csvlook
! gzcat test.gz | head | csvlook
Looks like we need to predict click column
for convenience I'll extract the zipped files, to make it easier for dask to read
! gunzip test.gz && gunzip train.gz
! ls -lh | grep 'test\|train'
As expected uncompressed files are about 6x bigger than compressed versions
There are number of columns we can use to build our model, however before going any further, let's decide which columns we can use. There are number of things we can do here but I'd like do check the distribution of values briefly.My strategy would be:
pandas-profiling for a quick visualization# housekeeping
import pandas as pd
pd.set_option('display.max_columns', 30)
import pandas as pd
import dask.dataframe as dd
from pandas_profiling import ProfileReport
Use dask to read and sample data
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1, processes=False, memory_limit='2GB')
client
# read training lazily with dask
train = dd.read_csv('train', dtype={'id': 'float64'})
we've got ~ 40.4 million rows of training data
%%time
# sample with replacement (though with replacement doesn't really matter as the data set is big enough)
s_train = train.sample(frac=0.001, random_state=42).compute()
s_train.head()
s_train.to_pickle('data/ctr_data.pickle', compression='gzip')
Let's have a quick look at the columns to decide which ones to use in our ML model.
I'll use pandas-profiling to get a nice view
s_train = pd.read_pickle('data/ctr_data.pickle', compression='gzip')
report = ProfileReport(s_train, title='Train Data Report', explorative=True)
# Save to file for offline viewing
report.to_file('Initial_Report.html')
# view report inside Jupyter Notebook
# NOTE: If you are opening up this ipynb file in github the provfiling report won't be visible, please download repo and view the outputted html file to see report
report.to_notebook_iframe()